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Abstract 

The idea of measuring distance between languages seems to have its roots in 
the work of the French explorer Dumont D'Urville (2). He collected compara- 
tive words lists of various languages during his voyages aboard the Astrolabe 
from 1826 to 1829 and, in his work about the geographical division of the 
Pacific, he proposed a method to measure the degree of relation among lan- 
guages. The method used by modern glottochronology, developed by Morris 
Swadesh in the 1950s, measures distances from the percentage of shared cog- 
nates, which are words with a common historical origin. Recently, we pro- 
posed a new automated method which uses normalized Levenshtein distance 
among words with the same meaning and averages on the words contained 
in a list. Recently another group of scholars (jl|; Q) proposed a refined of 
our definition including a second normalization. In this paper we compare 
the information content of our definition with the refined version in order 
to decide which of the two can be applied with greater success to resolve 
relationships among languages. 



1. Introduction 

Glottochronology tries to estimate the time at which languages diverged 
with the implicit assumption that vocabularies change at a constant average 
rate. The idea is to consider the percentage of shared cognates in order to 
compute the distance between pairs of languages (llll ). These lexical distances 
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are assumed to be, on average, logarithmically proportional to divergence 
times. In fact, changes in vocabulary accumulate year after year and two 
languages initially similar become more and more different. A recent example 
of the use of Swadesh lists and cognates to construct language trees are the 
studies of Gray and Atkinson (4) and Gray and Jordan (0). 

We recently proposed an automated method which uses Levenshtein dis- 
tance among words in a list (0; H). To be precise, we defined the lexical 
distance of two languages by considering a normalized Levenshtein distance 
among words with the same meaning and averaging on all the words con- 
tained in a Swadesh list. The normalization is extremely important and no 
reasonable results can be found without. Then, we transformed the lexi- 
cal distances in separation times. This goal was reached by a logarithmic 
rule which is the analogous of the adjusted fundamental formula of glot- 
tochronology (fiol ). Finally, the phylogenetic tree could be straightforwardly 
constructed. 

In (j^l; Q) we tested our method by constructing the phylogenetic trees of 
the Indo-European and the Austronesian groups. 

Almost at the same time, the above described automated method was 
used and developed by another large group of scholars (0; Q). They placed 
the method at the core of an ambitious project, the ASJP (The Automated 
Similarity Judgment Program). In their work they proposed a refined of 
our definition including a second normalization in the definition of lexical 
distance. 

The goal of this paper is to compare the information content of the two 
definitions in order to decide which of the two can be applied with greater 
success to resolve relationships among languages. 

Before tackling this problem we sketch our definition of lexical distance 
and the modification proposed in (jl|; 0) which is a refinement including a 
second normalization. Then we compare the information content of the two 
definitions and give our conclusion. 

2. Lexical distance 

Our definition of lexical distance between two words is a variant of the 
Levenshtein distance which is simply the minimum number of insertions, 
deletions, or substitutions of a single character needed to transform one word 
into the other. Our definition is taken as the Levenshtein distance divided 
by the number of characters of the longer of the two compared words. More 



2 



precisely, given two words and (3j their lexical distance D(cti,f3j) is given 

by 

D(a i ,p j )= \ (1) 

where Di(ai,/3j) is the Levenshtein distance between the two words and 
L(ai,j3j) is the number of characters of the longer of the two words a« and 
(3j. Therefore, the distance can take any value between and 1. Obviously 
D(azi,ai) = . 

The normalization is an important novelty and it plays a crucial role; no 
sensible results can been found without (H; @). 

We use distance between pairs of words, as defined above, to construct 
the lexical distances of languages. For any pair of languages, the first step is 
to compute the distance between words corresponding to the same meaning 
in the Swadesh list. Then, the lexical distance between each languages pair 
is defined as the average of the distance between all words(@; @). As a result 
we have a number between and 1 which we claim to be the lexical distance 
between two languages. 

Assume that the number of languages is N and the list of words for any 
language contains M items. Any language in the group is labeled a Greek 
letter (say a) and any word of that language by with 1 < % < M. Then, 
two words a.i and (3j in the languages a and j3 have the same meaning if 
i = j. 

Then the distance between two languages is 

%/3) = i^%,A) (2) 

i 

where the sum goes from 1 to M. Notice that only pairs of words with same 
meaning are used in this definition. This number is in the interval [0,1], 
obviously D(a, a) = 0. 

The results of the analysis is a JV x iV upper triangular matrix whose 
entries are the iV(iV — 1) non trivial lexical distances D(a,/3) between all 
pairs in a group. Indeed, our method for computing distances is a very 
simple operation, that does not need any specific linguistic knowledge and 
requires a minimum of computing time. 

A phylogenetic tree could be constructed from the matrix of lexical dis- 
tances D(a,/3), but this would only give the topology of the tree whereas 



3 



the absolute time scale would be missing. Therefore, we perform (0; @) a 
logarithmic transformation of lexical distances which is the analogous of the 
adjusted fundamental formula of glottochronologyffiol) . In this way we ob- 
tain a new N x N upper triangular matrix whose entries are the divergence 
times between all pairs of languages. This matrix preserves the topology of 
the lexical distance matrix but it also contains the information concerning 
absolute time scales. Then, the phylogenetic tree can be straightforwardly 
constructed. 

In (0; [sl) we tested our method constructing the phylogenetic trees of 
the Indo-European group and of the Austronesian group. In both cases we 
considered N = 50 languages. The database (fl^) that we used in (0; is 
composed by M = 200 words for any language. The main source for the 
database for the Indo-European group is the file prepared by Dyen ct al. 
in (Q). For the Austronesian group we used as the main source the lists 
contained in the huge database in (p). 



3. A second normalization 

A further modification has been proposed by ([]]; 0) in order to avoid 
possible similarity which could arose from accidental relative orthographical 
similarity of languages. 

Let us first define the global distance between languages a and 3 as 

r(a ' /J) = M(ii73T)g D(a " /3 ' ) (3) 

where the sum goes on all M(M — 1) pairs of words corresponding to different 
meanings in the two lists (M 2 is the total number of pairs and M is the 
number of pairs with same meaning). 

This quantity measures a distance of the vocabulary of the two languages, 
without comparing words with same meaning. In other words, it only account 
for general similarities in the frequency and ordering of characters. The point 
is that, at this stage, we don't know if T(a,(3) carries informations or only 
depends on accidental similarities. 

Assuming the second point of view, it is reasonable to define, according 
to (Jll;Q), a bi-normalized lexical distance as follows: 

„ , „. D(a, 3) , . 

D ^ 3) = wf) (4) 
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This second normalization should cancel the effects of accidental orthograph- 
ical similarities between the two languages. Notice that while by definition 
D(a, a) = D s (a, a) = 0, in all real cases F(a, a) ^ 0. 

We would like to stress that the idea of the proposed second normal- 
ization turns to be correct only if F{a,(3) is uncorrelated with the lexical 
distance between languages a and (3. In this case, in fact, it has vanishing 
information concerning their relationship. On the contrary, if it is positively 
correlated with the distance between the two languages, one can conclude 
that it contains some information that can be usefully exploited. 



4. Comparison of different definitions 

In order to decide which definition is better to use, D(a,(3) or D s (a,f3), 
we have to see if F(a, (3) is positively correlated with these distances. In case 
it is not, we will decide to use D s (pt,(3) since we eliminate errors due to ac- 
cidental similarities between vocabularies. On the contrary, if it is positively 
correlated, we would conclude that F(a, f3) carries some positive information 
about the degree of similarity of the two languages. In this second case, two 
languages will be, in average, closer for smaller F(a,(3) and we would decide 
to use D(a,P) since it incorporates the information contained in F(a,/3). 

In order to compute the correlation between distance and F(a,(3) we 
proceed as follows: first we define for a generic function f{a,(3) the average 
</> on all possible values of a and (3 as follows 

</>=^E (5) 

which is the average value of the function f(a, (3) in a linguistic group. Then, 
we define the correlation between D{a,f3) and F(a,(3) in a standard way as 

c(F, D) = <(r- <r»P- <^»> 

«r- <F>y><(D- <D>y>)* 

The result is that the correlation in the Indo-European group is 0.59173 
while in the Austronesian group is 0.46032. In both cases it is a quite high 
positive value (correlation may take any value between -1 and 1) and we 
conclude that eventual vocabulary similarities accounted by F(a,(3) carry 
information and are not at all accidental. The week point is that we have 
checked correlation against D(a,(3) which, at least from the point of view 
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of the proponents of the second normalization, linearly incorporates T(a,/3) 
since D(a,/3) = T(a, /3)D s (a, (3). 

From this point of view our result is not so astonishing. Nevertheless, we 
can also compute the correlation between the bi- normalized distance D s (a, (3) 
and r(a,/3). The definition is the same as ([6]) with D s substituting D. We 
obtain that the correlation C(T, D s ) in the Indo-European group is 0.54713 
while in the Austronesian group is 0.40169. These two data, although slightly 
smaller than the previous ones, are still quite high and confirm that T(a,/3) 
contains positive information. In other words, closer languages, both in the 
sense of a smaller D(a, (3) and a smaller D s (a, (3), will have on average smaller 

r(«,/3). 

We remark that the same correlation coefficients, both for D and D s , 
comes out, if the average (jSJ) is computed negletting the pairs were the same 
greek index is repeated. 

In order to complete our analysis we plot, only for the Austronesian lan- 
guages group, T(a, (3) as a function of D(a, (3) (Fig. 1 left) and as a function 
of D s (a, (3) (Fig. 1 right). Any point in the figures represents a pair of lan- 
guages. In both cases we perceive the positive correlation which is evidenced 
by the best linear fits. 

We remark that the points which lie on the vertical axes at the distance 
value correspond, in both figures, to pairs for which the same language is 
compared. For these points the D(a,a) = D s (a,a) are all equal to while 
the T(a, a) are positive. It is easy to see that the self-distances accounted 
by the T(a,a), which compare words with different meaning in the same 
language, are, on average, smaller than the T(a,f3) which compare words 
with different meaning in two different languages. This fact confirms that 
the information carried by r(a, 0) is positive. 

In other words, closer related languages, not only have more similar words 
corresponding to the same meaning, but the general occurrence and ordering 
of characters in words is more similar. 

5. Conclusions 

In this work we have analyzed two different possibilities for the definition 
of automated languages distance. More precisely, starting from a Levenshtein 
distance, we have analyzed two possible normalizations. The choice between 
them is only made by using statistical arguments. 
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Figure 1: Global distance T(a,(3) versus lexical distance D(a,0) (left) and versus bi- 
normalized distance D s {a, (3) (right) for Austronesian languages. The positive correlation 
is evidenced by the best linear fits. The points which lie on he vertical axes at the 
distance value correspond to pairs for which the same language is compared. For these 
points D(a, a) = D s (a, a) while T(a, a) ^ 0. 

Our conclusion is that it is preferable to use the single normalization 
definition of distance D(a,/3), otherwise a part of the information about 
affinities of languages is lost. In fact, our analysis shows that closer related 
languages have smaller global distance. This means that not only they have 
more similar words for the same meaning, but the general occurrence and 
ordering of characters in words is more similar. 
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